# LLM as a Judge Critic (PART 2)
## Objective

This notebook demonstrates how to leverage structured outputs from OpenAI's GPT-4o-mini model for data labeling of climate related research papers. The task involves analyzing academic texts to identify and classify mentions of datasets while ensuring consistency in context across pages.


## Workflow

**PDF Text Extraction:**
   * Use PyMuPDF to extract pages from PDF documents.
   * Prefiltering document pages using an HF-trained model.

**Weakly Supervised Data Labeling**
   * Use the GPT-4o-mini model with a customized prompt for structured data extraction.

**LLM as a Judge (Validation & Error Correction):**
   * Use an LLM to validate extracted dataset mentions.
   * Correct or remove errors in dataset identification.
   * Filter only **valid dataset mentions (`valid: true`)**, discarding invalid entries.
**Autonomous Reasoning Agent**
   * Use a reasoning pipeline to validate the LLM as a judge output
**Next Steps**
   * Scale this into a batch processing of multiple files / directory of research papers.


Install Required Packages

In [1]:
%%capture
!pip install pymupdf openai nltk scikit-learn python-dotenv

## LLM-as-a-Judge for Quality Assessment

After getting the initial extraction of dataset mentions, we will validate its output using via an LLM-as-a-judge pipeline

In [1]:
import json
import os
from tqdm.auto import tqdm
from pydantic import BaseModel, Field, ValidationError
from typing import List, Optional
from enum import Enum


# Define Enums for categorical fields
class Context(str, Enum):
    background = "background"
    supporting = "supporting"
    primary = "primary"


class Specificity(str, Enum):
    properly_named = "properly_named"
    descriptive_but_unnamed = "descriptive_but_unnamed"
    vague_generic = "vague_generic"


class Relevance(str, Enum):
    directly_relevant = "directly_relevant"
    indirectly_relevant = "indirectly_relevant"
    not_relevant = "not_relevant"


class DatasetEntry(BaseModel):
    raw_name: Optional[str] = Field(
        ..., description="The exact dataset name as it appears in the text."
    )
    harmonized_name: Optional[str] = Field(
        None, description="The standardized or full name of the dataset."
    )
    acronym: Optional[str] = Field(
        None, description="The short name or acronym associated with the dataset."
    )
    context: Context
    specificity: Specificity
    relevance: Relevance
    mentioned_in: Optional[str] = Field(
        None, description="The exact text excerpt where the dataset is mentioned."
    )
    producer: Optional[str] = Field(
        None, description="The organization responsible for producing the dataset."
    )
    data_type: Optional[str] = Field(
        None, description="The type of data represented by the dataset."
    )


class LabelledResponseFormat(BaseModel):
    dataset: List[DatasetEntry] = Field(
        ..., description="A list of datasets mentioned in the paper."
    )
    dataset_used: bool = Field(
        ..., description="A boolean indicating if a dataset is used in the paper."
    )

In [2]:
# Create a pydantic model for the judge response
from pydantic import model_validator


class JudgeDatasetEntry(BaseModel):
    raw_name: Optional[str] = Field(
        ..., description="The exact dataset name as it appears in the text."
    )
    harmonized_name: Optional[str] = Field(
        None, description="The standardized or full name of the dataset."
    )
    acronym: Optional[str] = Field(
        None, description="The short name or acronym associated with the dataset."
    )
    context: Context
    specificity: Specificity
    relevance: Relevance
    producer: Optional[str] = Field(
        None, description="The organization responsible for producing the dataset."
    )
    data_type: Optional[str] = Field(
        None, description="The type of data represented by the dataset."
    )
    year: Optional[str] = Field(
        None,
        description="The year associated with the dataset, if explicitly mentioned.",
    )
    valid: bool = Field(
        ..., description="True if the mention is valid, false otherwise."
    )
    invalid_reason: Optional[str] = Field(
        None, description="Reason why the mention was invalid (if applicable)."
    )
    sent: Optional[str] = Field(
        None, description="The exact sentence where the dataset is mentioned."
    )
    # entities: Optional[EmpiricalMention] = Field(None, description="Additional empirical context for the dataset.")

    # Validator to ensure valid and invalid_reason consistency
    @model_validator(mode="after")
    def check_validity(cls, instance):
        if not instance.valid and not instance.invalid_reason:
            raise ValueError("If 'valid' is False, 'invalid_reason' must be provided.")
        return instance


class JudgeDatasetGroup(BaseModel):
    mentioned_in: Optional[str] = Field(
        None, description="The exact text excerpt where the dataset is mentioned."
    )
    datasets: List[JudgeDatasetEntry] = Field(
        ..., description="A list of validated datasets mentioned in the paper."
    )


class JudgeResponseFormat(BaseModel):
    page_number: int = Field(..., description="The page number in the document.")
    dataset_used: bool = Field(
        ...,
        description="Flag indicating whether a valid dataset is mentioned in the page.",
    )
    data_mentions: List[JudgeDatasetGroup] = Field(
        ...,
        description="A list of structured dataset information mentioned in the paper.",
    )

In [3]:
# judge prompt
JUDGE_PROMPT = """You are an expert in dataset validation. Your task is to assess whether each dataset mention is **valid, invalid, or requires clarification**, ensuring correctness and consistency based on the dataset's **empirical context**.

---

### **Dataset Validation Criteria**
A dataset is **valid** if:
1. **It is structured**—collected systematically for research, policy, or administrative purposes.
2. **It is reproducible**—meaning it consists of collected records rather than being derived purely from computations or models.

**Always Valid Datasets:**
- Government statistical and geospatial datasets (e.g., census, official land records).  
- Official surveys, administrative records, economic transaction data, and scientific research datasets.  

**Invalid Datasets:**
Set as invalid all `"raw_name"` that belong under the following classes.
- Derived indicators or computational constructs (e.g., "wealth score", "mine dummy", "district total production").  
- Standalone statistical metrics without a clear underlying dataset (e.g., "average income growth rate" without source data).  
- General organizations, reports, or methodologies (e.g., "World Bank", "UNDP Report", "machine learning model").  

**Uncertain Cases:**
- If a dataset is **vaguely named but potentially valid**, set it as valid but return: `"Potentially valid—needs dataset name confirmation."`  
- If a dataset reference is **too generic** (e.g., `"time-varying data on production"`), set it as valid but return: `"Needs clarification—dataset name is too generic."`  

---

### **Key Validation Rules**
1. **Consistency Check:**  
   - If a `"raw_name"` has been marked **valid earlier**, it **must remain valid** unless its meaning significantly differs in a new context.

2. **Context-Aware Inference:**  
   - If certain details are missing such as the **Year**, **Producer**, or **Data Type**, try to extract them from the `mentioned_in` field if available and correctly relate to the data.

3. **Data Type Classification (Flexible & Adaptive):**  
   - Infer the most appropriate `"data_type"` dynamically from context.  
   - Possible types: **Surveys, geospatial data, administrative records, financial reports, research datasets, climate observations, etc.**  
   - If **no predefined category fits**, create a **new `"data_type"` that best describes the dataset.**  

4. **Producer Identification:**  
   - If the **producer (organization/institution) is explicitly mentioned**, extract it.  
   - If not mentioned, **do not infer—set `"producer": None"` instead.**  

---

### **JudgeResponseFormat Schema**
Each dataset assessment must conform strictly to the JudgeResponseFormat schema."""

In [4]:
def validate_with_llm_judge(page):
    """
    Validate dataset mentions using LLM-as-a-judge with structured outputs.

    Parameters:
        page (dict): A single page's data from the extracted data JSON.

    Returns:
        dict: The page with validated dataset mentions or None if an error occurs.
    """
    # Prepare the input for LLM
    input_data = {
        "page_number": page.get("page"),
        "data_mentions": page.get("data_mentions", []),
    }

    # Skip validation if there are no data mentions
    if not input_data["data_mentions"]:
        return None

    # Prepare messages for the LLM
    messages = [
        {"role": "system", "content": JUDGE_PROMPT},
        {"role": "user", "content": f"{json.dumps(input_data, indent=2)}"},
    ]

    try:
        completion = client.beta.chat.completions.parse(
            model=MODEL,
            messages=messages,
            temperature=0.2,
            response_format=JudgeResponseFormat,
        )

        # Validate and parse the LLM's structured response
        parsed_data = completion.choices[0].message.parsed

        # Update the page with validated mentions
        page["dataset_used"] = parsed_data.dataset_used
        page["data_mentions"] = [
            mention.model_dump() for mention in parsed_data.data_mentions
        ]

        return page

    except ValidationError as ve:
        print(f"Validation error on page {page.get('page')}: {ve}")
        return None
    except Exception as e:
        print(f"Error validating page {page.get('page')}: {e}")
        return None

In [5]:
def process_judge_validation(input_json):
    """
    Process the entire JSON file with LLM-as-a-judge for validation.

    Parameters:
        input_json (dict): The JSON structure containing the source and pages.

    Returns:
        dict: The updated JSON structure with validated pages.
    """
    # Process each page in the JSON file
    for page_idx, page in tqdm(
        enumerate(input_json.get("pages", [])), desc="Processing pages"
    ):
        # Validate each page with data_mentions
        if page.get("data_mentions"):
            validated_page = validate_with_llm_judge(page)
            if validated_page:
                # Update the page with validated data mentions
                input_json["pages"][page_idx] = (
                    validated_page  # page_idx might be wrong
                )
        else:
            pass
    output_path = "./output/llm_judge_validation"
    os.makedirs(output_path, exist_ok=True)
    output_file_path = os.path.join(output_path, f"{input_json['source']}.json")
    # Save the updated JSON file with validated pages
    with open(output_file_path, "w") as outfile:
        json.dump(input_json, outfile, indent=4)

In [13]:
# Output from the previous step
input_file_path = "output/extracted_data/The-local-socioeconomic-effects-of-gold-mining-evidence-from-Ghana.json"
with open(input_file_path, "r") as infile:
    input_data = json.load(infile)

In [7]:
from openai import OpenAI

# Load environment variables from .env file
# load_dotenv()

API_KEY = "YOUR_API_KEY"
MODEL = "gpt-4o-mini"
client = OpenAI(api_key=API_KEY)  # initialize the client

In [8]:
# inspect input

input_data.get("pages")

[{'page': 4,
  'dataset_used': True,
  'data_mentions': [{'mentioned_in': 'We also allow for spillovers across \ndistricts, in a district-level analysis. We use two complementary geocoded household data sets \nto analyze outcomes in Ghana: the Demographic and Health Survey (DHS) and the Ghana \nLiving Standard Survey (GLSS), which provide information on a wide range of welfare \noutcomes. The paper contributes to the growing literature on the local effects of mining.',
    'datasets': [{'raw_name': 'Demographic and Health Survey (DHS)',
      'harmonized_name': 'Demographic and Health Survey (DHS)',
      'acronym': 'DHS',
      'context': 'primary',
      'specificity': 'properly_named',
      'relevance': 'directly_relevant',
      'producer': None,
      'data_type': 'Surveys & Census Data',
      'sent': 'We use two complementary geocoded household data sets \nto analyze outcomes in Ghana: the Demographic and Health Survey (DHS) and the Ghana \nLiving Standard Survey (GLSS), which 

In [9]:
# uncomment to run
# process_judge_validation(input_data)

Processing pages: 0it [00:00, ?it/s]

Error validating page 11: Could not parse response content as the length limit was reached - CompletionUsage(completion_tokens=16384, prompt_tokens=1916, total_tokens=18300, completion_tokens_details=CompletionTokensDetails(accepted_prediction_tokens=0, audio_tokens=0, reasoning_tokens=0, rejected_prediction_tokens=0), prompt_tokens_details=PromptTokensDetails(audio_tokens=0, cached_tokens=1152))


In [10]:
print(("done!"))

done!


In [14]:
# inspect the output of the validation

with open(
    "output/llm_judge_validation/The-local-socioeconomic-effects-of-gold-mining-evidence-from-Ghana.json",
    "r",
) as f:
    extracted_data = json.load(f)
    print(json.dumps(extracted_data, ensure_ascii=False, indent=2))

{
  "source": "The-local-socioeconomic-effects-of-gold-mining-evidence-from-Ghana",
  "pages": [
    {
      "page": 4,
      "dataset_used": true,
      "data_mentions": [
        {
          "mentioned_in": "We also allow for spillovers across \ndistricts, in a district-level analysis. We use two complementary geocoded household data sets \nto analyze outcomes in Ghana: the Demographic and Health Survey (DHS) and the Ghana \nLiving Standard Survey (GLSS), which provide information on a wide range of welfare \noutcomes. The paper contributes to the growing literature on the local effects of mining.",
          "datasets": [
            {
              "raw_name": "Demographic and Health Survey (DHS)",
              "harmonized_name": "Demographic and Health Survey (DHS)",
              "acronym": "DHS",
              "context": "primary",
              "specificity": "properly_named",
              "relevance": "directly_relevant",
              "producer": null,
              "data_t

In this step, we perform the LLM as a Judge model to validate the results of the weakly supervised model.
### Next Step

The output from this step will be processed by the Autonomous Reasoning Agent to validate and refine the extracted dataset mentions, ensuring their quality and correctness.